Getting yourself ready for a data analyst interview? We’ve got everything covered. The following are 30 questions you may be asked and these clear answers will give you an edge: Covering many skills and practical challenges, these questions and answers can increase your confidence. This guide will benefit you no matter your time in the industry. We’re ready to begin by looking at the questions.
SQL & Database Management
1. What are JOINs in SQL? Explain different types.
JOINS let you combine data from two or more tables based on a related column.
- INNER JOIN: Only returns matching rows.
- LEFT JOIN: All records from the left table + matched records from the right.
- RIGHT JOIN: All from the right + matches from the left.
- FULL OUTER JOIN: All records from both, matching or not.
- CROSS JOIN: Returns all possible combinations (cartesian product).
2. How do you handle duplicate records in SQL?
Use DISTINCT
to remove duplicates or ROW_NUMBER()
in a CTE to filter them. Example:
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY id) AS rn
FROM your_table
)
DELETE FROM CTE WHERE rn > 1;
3. What is the difference between WHERE and HAVING clauses?
- WHERE: Filters rows before grouping.
- HAVING: Filters after grouping using GROUP BY.
4. How do you rank rows in SQL using window functions?
Use RANK()
, DENSE_RANK()
, or ROW_NUMBER()
:
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;
5. Explain normalization and denormalization in databases.
- Normalization: Breaks data into multiple tables to reduce redundancy.
- Denormalization: Combines tables for faster read performance, at the cost of some redundancy.
6. How do you identify and handle missing values in a database?
Use IS NULL
to find missing data and COALESCE()
or CASE
to handle it:
SELECT COALESCE(column, 'Default') FROM table;
7. Write an SQL query to find the second highest salary from an employee table.
SELECT MAX(salary) FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8. What is an index, and how does it improve query performance?
An index is like a lookup table — it speeds up data retrieval, especially with WHERE
, JOIN
, or ORDER BY
. It trades off extra storage and slightly slower writes.
9. Explain CTE (Common Table Expressions) with an example.
A CTE makes queries more readable and reusable:
WITH TopSalaries AS (
SELECT name, salary FROM employees WHERE salary > 50000
)
SELECT * FROM TopSalaries WHERE name LIKE 'A%';
10. What is the difference between INNER JOIN and OUTER JOIN?
- INNER JOIN: Only matched rows.
- OUTER JOIN: Includes unmatched rows too (LEFT, RIGHT, or FULL).
Python for Data Analysis
11. What are the key libraries used in Python for data analysis?
- pandas for data manipulation
- NumPy for numerical operations
- Matplotlib and Seaborn for visualization
- scikit-learn for machine learning
- statsmodels for statistics
12. How do you handle missing data in pandas?
- Use
.isnull()
,.notnull()
to detect - Use
.fillna()
to replace - Use
.dropna()
to remove
13. Explain the difference between apply(), map(), and lambda functions in pandas.
map()
: Element-wise for Seriesapply()
: Works on rows/columns for DataFrameslambda
: Anonymous function often used with apply()
or map()
14. How do you merge and join datasets in pandas?
Use merge()
, join()
or concat()
:
df1.merge(df2, on='id', how='inner')
15. Write a Python function to find the mean and median of a dataset.
def mean_median(data):
import numpy as np
return np.mean(data), np.median(data)
16. Explain the difference between list comprehension and a for loop.
- List comprehension is shorter and more Pythonic:
squares = [x**2 for x in range(10)]
- For loop is more flexible for complex logic.
17. What are NumPy arrays, and how do they differ from Python lists?
NumPy arrays are faster and support vectorized operations. Unlike lists, all elements must be the same type.
18. How do you visualize data using Matplotlib and Seaborn?
Use plt.plot()
, plt.hist()
from Matplotlib, and sns.barplot()
, sns.heatmap()
from Seaborn:
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(data=df, x='Category', y='Value')
Statistics & Probability
19. What is the difference between descriptive and inferential statistics?
- Descriptive: Summarizes data (mean, median, std. dev.)
- Inferential: Makes predictions based on sample data (confidence intervals, hypothesis testing)
20. Explain the concept of p-value in hypothesis testing.
The p-value tells how likely you’d see the observed data if the null hypothesis were true. A low p-value (< 0.05) means reject the null.
21. What is correlation vs. causation?
- Correlation: Variables move together.
- Causation: One variable causes the other.
- Correlation doesn't imply causation.
22. How do you check if a dataset follows a normal distribution?
- Plot a histogram or Q-Q plot
- Use tests like Shapiro-Wilk or Kolmogorov-Smirnov
23. Explain the central limit theorem and its significance.
The CLT says that the sampling distribution of the mean will be normal if the sample size is large, even if the population isn’t. It’s why we can use many statistical methods.
24. What is the difference between Type I and Type II errors?
- Type I (False Positive): Rejecting a true null hypothesis
- Type II (False Negative): Failing to reject a false null
Data Visualization & BI Tools
25. What are the best practices for creating effective dashboards?
- Keep it simple and clean
- Use the right charts
- Highlight KPIs
- Use filters for interaction
- Tell a story with data
26. How do you decide between a bar chart and a line chart?
- Use bar charts for comparing categories
- Use line charts to show trends over time
27. What are the advantages of Power BI over Excel?
- Better handling of large data
- Interactive dashboards
- Easier data refresh & modeling
- Built-in DAX for advanced calculations
28. Explain the difference between measures and calculated columns in Power BI.
- Calculated column: Adds a new column to the table
- Measure: A dynamic calculation used in visuals (e.g., sum, average)
29. What are DAX functions, and how are they used in Power BI?
DAX (Data Analysis Expressions) is used for writing formulas. Example:
Total Sales = SUM(Sales[Amount])
30. How do you handle large datasets efficiently in visualization tools?
- Use data aggregations
- Load only necessary columns
- Optimize data types
- Use filters and page-level visuals
- Avoid high-cardinality columns in visuals
0 Comments